Voice Generation: Prosody Transfer

Demo

Voice generation involves the creation of voices that best conform to supportive evidence, such as the structure of the skull, or the 3D image/scan of a face. Often, the supportive evidence may not have the information needed to generate every aspect of the voice signal, including language, content, style of delivery etc. This is especially true when the evidence is merely the image of a face or the structure of the skull. As such, there are many challenges associated with this process, each of which requires deeper consideration and different sets of approaches to address and solve.

One such challenge is that of rendering the correct intonation or prosody to the generated (or synthesized) voice signal. Of the many solutions we have explored for this, a good one seems to be that or using a “style of rendering” or prosody from an exemplar voice sample, rendered by a human, and “transferring” it to the generated voice signal. Note that in this case, the goal is to simply emulate the prosody -- the content of the generated signal may be different from that of the exemplar.

In the examples below, we show the results of one such mechanism that we have devised to transfer prosody from an exemplar (called a “token”) to the generate signal. The first set labeled “tokens” shows 5 exemplars from which we have attempted to “lift” the prosody alone. The next two sets of examples of the results of this process on signals that have the same linguistic content, and those that have different content (which is our goal). A paper describing our specific techniques is available at

Paper: Gao, Y., Raj, B., Singh, R. (2019) Joint-Attention Learning in Prosody Transfer Speech Synthesis

1. Derivation of prosody from different datasets

Token representation in our model are an approach to factorize prosodies of the training dataset. During the test phase, if we condition the model on specific token, the synthesis result would represent that token's learned prosody.

1.1 Different styles and prosodies are derived from VCTK dataset

Uterrance: "I’ve felt the chance that I have a number of options."

To listen, files are at following:

Token1: 
Token2: 
Token3: 
Token4: 
Token5: 

1.2 Different prosodies are derived from Beijing dataset

Uterrance: "Just recovered a fumble on ensuing kickoff."

To listen, files are at following:

Token1: 
Token2: 
Token3: 
Token4: 
Token5: 

1.3 Different prosodies are derived from Blizzard 2013 dataset

Uterrance: "Just recovered a fumble on ensuing kickoff."

To listen, files are at following:

Token1: 
Token2: 
Token3: 
Token4: 
Token5: 

2. Prosody Transfer

In this section, we show the results of prosody transfer from referene utterance to test utterance.

2.1 Parallel utterances

The following shows three example of prosody transfer synthesis.

In each example, the text of utterance to synthesis is the same as the reference's.

Example 1

Utterance text content: My mother always took him to the town on a market day in a light gig.

Refence utterance:
Neutral prosody result:
Prosody Transfer result:

Example 2

Utterance text content: So we never saw Dick any more.

Refence utterance:
Neutral prosody result:
Prosody Transfer result:

Example 3

Utterance text content: You will be to visit me in prison with a basket of provisions, you will not refuse to visit me in prison?

Refence utterance:
Neutral prosody result:
Prosody Transfer result:

2.2 Unparallel utterance

The following shows three example of unparallel prosody transfer synthesis.

In each example, text of the utterance to synthesis is different from the reference's.

Example 1

The prosody of the unparallel reference utterance will be transfered to the synthesis results having different text contents.

Reference utterance: 
Text: My mother always took him to the town on a market day in a light gig.


Prosody Transfer text 1: 
Text: So we never saw Dick any more.


Prosody Transfer text 2: 
Text: Just recovered a fumble on ensuing kickoff.


Example 2

Reference utterance:
Text: You will be to visit me in prison with a basket of provisions, you will not refuse to visit me in prison?


Prosody Transfer text 1:
Text: My mother always took him to the town on a market day in a light gig.


Prosody Transfer text 2:
Text: There was nothing disagreeable in Mister Rushworth's appearance.


Example 3

Reference utterance:
Text: There was nothing disagreeable in Mister Rushworth's appearance, and Sir Thomas was liking him already.


Prosody Transfer text 1:
Text: Just recovered a fumble on ensuing kickoff.


Prosody Transfer text 2:
Text: My mother always took him to the town on a market day in a light gig.